Lecture 2: Data Frame, Matrix, List

Abhijit Dasgupta

September 13, 2016

Preamble

Practice makes perfect

R packages

R packages live on CRAN and its mirrors. To install an R package:

install.packages('dplyr', repos='http://cran.r-project.org')

or

R Packages

To use a package, or rather, use the functions from the package, you have to load it into R

library(dplyr)

We’ll talk about packages later in the semester.

We will concentrate now on what is known as Base R, that is, the functions that are available when R is installed

Loading data

We will usually load CSV files, since they are the easiest for R. The typical suggestion if you have Excel data is to save the sheet as a CSV and then import it into R.

You can also load Excel files directly using either the readxl or rio packages

The structure of data sets

Tables

  • Data is typically in a rectangular format

    • spreadsheet, database table
    • CSV (comma-separated values) or TSV (tab-separated values) files
  • Characteristic

    • Rows are observations
    • Columns are variables
    • Each column has the same number of observations

Tidy data is a particularly amenable format for data analysis.

An example GEO dataset

Lower back pain symptoms dataset on Kaggle.com

Breast Cancer Proteome dataset on Kaggle.com

Let’s look at a dataset

Let’s look at a dataset

data_spine <- read.csv('lecture2_data/Dataset_spine.csv')
head(data_spine)
##   Pelvic.incidence Pelvic.tilt Lumbar.lordosis.angle Sacral.slope
## 1         63.02782   22.552586              39.60912     40.47523
## 2         39.05695   10.060991              25.01538     28.99596
## 3         68.83202   22.218482              50.09219     46.61354
## 4         69.29701   24.652878              44.31124     44.64413
## 5         49.71286    9.652075              28.31741     40.06078
## 6         40.25020   13.921907              25.12495     26.32829
##   Pelvic.radius Degree.spondylolisthesis Pelvic.slope Direct.tilt
## 1      98.67292                -0.254400    0.7445035     12.5661
## 2     114.40543                 4.564259    0.4151857     12.8874
## 3     105.98514                -3.530317    0.4748892     26.8343
## 4     101.86850                11.211523    0.3693453     23.5603
## 5     108.16872                 7.918501    0.5433605     35.4940
## 6     130.32787                 2.230652    0.7899929     29.3230
##   Thoracic.slope Cervical.tilt Sacrum.angle Scoliosis.slope
## 1        14.5386      15.30468   -28.658501         43.5123
## 2        17.5323      16.78486   -25.530607         16.1102
## 3        17.4861      16.65897   -29.031888         19.2221
## 4        12.7074      11.42447   -30.470246         18.8329
## 5        15.9546       8.87237   -16.378376         24.9171
## 6        12.0036      10.40462    -1.512209          9.6548
##   Class.attribute
## 1        Abnormal
## 2        Abnormal
## 3        Abnormal
## 4        Abnormal
## 5        Abnormal
## 6        Abnormal

Ignore the first ##; it denotes that this is R output

Let’s look at a dataset

  • Assumes that the first row has variable names
  • Replaces spaces with .
  • Keeps numeric and character variables together

Let’s look at a dataset

View(data_spine)  # It looks like a matrix

Let’s look at a dataset

str(data_spine) # Structure of a dataset
## 'data.frame':    310 obs. of  13 variables:
##  $ Pelvic.incidence        : num  63 39.1 68.8 69.3 49.7 ...
##  $ Pelvic.tilt             : num  22.55 10.06 22.22 24.65 9.65 ...
##  $ Lumbar.lordosis.angle   : num  39.6 25 50.1 44.3 28.3 ...
##  $ Sacral.slope            : num  40.5 29 46.6 44.6 40.1 ...
##  $ Pelvic.radius           : num  98.7 114.4 106 101.9 108.2 ...
##  $ Degree.spondylolisthesis: num  -0.254 4.564 -3.53 11.212 7.919 ...
##  $ Pelvic.slope            : num  0.745 0.415 0.475 0.369 0.543 ...
##  $ Direct.tilt             : num  12.6 12.9 26.8 23.6 35.5 ...
##  $ Thoracic.slope          : num  14.5 17.5 17.5 12.7 16 ...
##  $ Cervical.tilt           : num  15.3 16.78 16.66 11.42 8.87 ...
##  $ Sacrum.angle            : num  -28.7 -25.5 -29 -30.5 -16.4 ...
##  $ Scoliosis.slope         : num  43.5 16.1 19.2 18.8 24.9 ...
##  $ Class.attribute         : Factor w/ 2 levels "Abnormal","Normal": 1 1 1 1 1 1 1 1 1 1 ...

So this is a data.frame object with 310 observations and 13 variables, of which one is a factor and the rest are numeric

It looks like a list of things

Dataframes

Dataframes are the primary mode of storing datasets in R

They were revolutionary in that they kept heterogeneous data together

They share properties of both a matrix and a list

class(data_spine)
## [1] "data.frame"

Technically, a data.frame is a list of vectors (or objects, generally) of the same length

Matrices

A matrix is a rectangular array of data of the same type

matrix(0, nrow=2, ncol=4)
##      [,1] [,2] [,3] [,4]
## [1,]    0    0    0    0
## [2,]    0    0    0    0
matrix(letters, nrow=2)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] "a"  "c"  "e"  "g"  "i"  "k"  "m"  "o"  "q"  "s"   "u"   "w"   "y"  
## [2,] "b"  "d"  "f"  "h"  "j"  "l"  "n"  "p"  "r"  "t"   "v"   "x"   "z"
matrix(letters, nrow=2, byrow=T)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] "a"  "b"  "c"  "d"  "e"  "f"  "g"  "h"  "i"  "j"   "k"   "l"   "m"  
## [2,] "n"  "o"  "p"  "q"  "r"  "s"  "t"  "u"  "v"  "w"   "x"   "y"   "z"

Matrices

You can create a matrix from a set of vectors of the same length

x <- c(1,2,3,4)
y <- c(10,20,30,40)

Put columns together

cbind(x,y) # Column bind
##      x  y
## [1,] 1 10
## [2,] 2 20
## [3,] 3 30
## [4,] 4 40

Matrices

You can create a matrix from a set of vectors of the same length

x <- c(1,2,3,4)
y <- c(10,20,30,40)

Put rows together

rbind(x,y) # Row bind
##   [,1] [,2] [,3] [,4]
## x    1    2    3    4
## y   10   20   30   40

Extracting elements

example_matrix = rbind(x,y)
example_matrix
##   [,1] [,2] [,3] [,4]
## x    1    2    3    4
## y   10   20   30   40
example_matrix[1,] # Extracts 1st row
## [1] 1 2 3 4
example_matrix[,2] # extracts 2nd column as a vector, prints horizontally
##  x  y 
##  2 20
example_matrix[1,4]
## x 
## 4

Matrix properties

example_matrix
##   [,1] [,2] [,3] [,4]
## x    1    2    3    4
## y   10   20   30   40
nrow(example_matrix) # Number of rows
## [1] 2
ncol(example_matrix) # Number of columns
## [1] 4
dim(example_matrix) # shortcut for above
## [1] 2 4

Matrix arithmetic

example_matrix
##   [,1] [,2] [,3] [,4]
## x    1    2    3    4
## y   10   20   30   40
example_matrix + 5 # Add 5 to each element
##   [,1] [,2] [,3] [,4]
## x    6    7    8    9
## y   15   25   35   45
example_matrix * 2 # Multiply each element by 2
##   [,1] [,2] [,3] [,4]
## x    2    4    6    8
## y   20   40   60   80